1 Summary

Exploratory data analysis of a healthcare-fraud dataset from Kaggle is presented.

Section 2 describes how clues in the dataset yield informed guesses about its scope. Terminology and data integrity are also discussed.

The exploratory analysis presented in Sec. 3 has two goals:

A summary of fraud associations is presented in Sec. 4.

2 Dataset

Description

The dataset contains training data and test data, each with four csv files.

The training data was used for the analysis. Details of the csv files:

  • Inpatient visits to a hospital (30 columns, ~ 40,000 rows)
  • Outpatient visits to a hospital (27 columns, ~ 520,000 rows)
  • The patients who made these hospital visits (25 columns, ~ 140,000 rows)
  • The hospitals visited, referred to as providers (2 columns: provider ID and PotentialFraud label, ~ 5,400 rows)

The patients in the dataset appear to be Medicare patients residing in the US and Africa.

While 9.4% of the providers in the training data have the “potential fraud” label, 38% of the claims were submitted by these providers. Even more striking, 58% of the inpatient claims were submitted by “potential fraud” providers.

Scope

The hospital visits in the dataset occurred throughout 2009 and were mainly limited to 2009.

Multiple clues in the dataset indicate that the hospital patients are Medicare patients. Medicare covers patients who are 65 or older, certain younger people with disabilities, and people with end-stage renal disease (permanent kidney failure).

  • The first clue leading to this conclusion is that for nearly all of the inpatient visits, the deductible paid by the patient was $1068, the 2009 deductible for an inpatient visit covered by Medicare. (About 2% of the inpatient visits had NA for the deductible paid, but for all other inpatient visits the deductible paid was $1068.)
  • Consistent with the special treatment of end-stage renal disease by Medicare, the table of patient info includes a boolean feature RenalDiseaseIndicator. This column is among other columns that identify a patient, such as birthdate, gender and race.
  • The column names NoOfMonths_PartACov, NoOfMonths_PartBCov in the table of patients can also be understood in this context: these columns likely gives the number of months in 2009 the patient had coverage by Medicare Part A and Medicare Part B, respectively. (Each of these columns contains integer values from 0 to 12.)

The patients appear to be residents of the US and Africa.

  • Each patient is identified with a state and a county, both labeled by integers (referred to here as codes). There are 52 different state codes ranging from 1 to 54. The county codes are three-digit integers ranging from 0 to 999.
  • The state and county codes appear to be components of 5-digit SSA (social-security administration) codes that identifies patient location. For example, the dataset includes 156 patients with state code 49 and county code 430, and these patients probably reside in Henrico County, Virginia (SSA code 49430).
  • Several consistency checks support this conclusion. For example, nearly all of the ~ 3,000 combinations of state code and county code in the dataset correspond to locations specified by a 5-digit SSA code. Five US states (AK, CA, IL, NH, VA) in the dataset included county codes that I could not identify from the SSA code data published by the National Bureau of Economic Research.
  • For simplicity, I will use “SSA state code” to refer to the first two digits of a 5-digit SSA code that identifies patient location. Using this terminology, SSA state codes 1 to 53 correspond to the 50 US states, together with District of Columbia, Puerto Rico, and the US Virgin Islands. For example, state code 53 corresponds to Wyoming.
  • SSA state codes above 53 are also defined: for instance, 54 and 55 corresponds to Africa and Asia, respectively.
  • The SSA state codes in the dataset correspond to the 50 US states, together with the District of Columbia and Africa.

The number of healthcare providers in the dataset (including both training and test data) is 6,763. This is somewhat larger than the number of hospitals in the US in 2009 (5,795).

  • Is this apparent discrepancy due to the inclusion of Medicare patients living in Africa? The number of providers in the dataset visited only by patients residing in Africa is 55, so this is probably not the explanation.
  • The 2009 count of US hospitals originated from the American Hospital Association (AHA). Although the description of the healthcare-fraud dataset on Kaggle refers to “hospital visits,” it may be that not all healthcare providers visited met the AHA definition of a hospital.

The dataset includes information for ~ 690,000 visits to a hospital (or similar healthcare provider), including ~ 50,000 inpatient visits and ~ 640,000 outpatient visits. By way of comparison, the number of inpatient hospital visits in the US in 2009 for which Medicare was the expected primary payer was ~ 15 million.

The number of patients in the dataset is ~ 150,000, compared to ~ 46 million people enrolled for Medicare coverage in 2009.

Within the Medicare system, payments for doctor services are distinct from payments to hospitals. The dataset does not appear to include information about payments to doctors.

  • For both inpatient and outpatient visits covered by Medicare, patients pay 20% of the cost of doctor services.
  • In the dataset, the information about payments is contained in the features InscClaimAmtReimbursed and DeductibleAmtPaid. Payments made by patients for doctor services are distinct from these two categories.
  • In principle the payments made by patients for doctor services could be lumped in with the feature DeductibleAmtPaid, but clearly this was not done for inpatient visits, since all claims have $1068 or NA for this feature. (As noted above, $1068 is the 2009 deductible for an inpatient visit covered by Medicare.)
  • Note that a “potential fraud” label is given for providers but not for doctors. This focus on provider fraud may explain the apparent lack of information in the dataset about payments for doctor services.

Terminology

An outpatient hospital visit is often defined as a visit for which it is not necessary to stay overnight at the hospital.

However, Medicare defines an outpatient as one who has not formally been admitted to a hospital with a doctor’s order. A Medicare publication explains that the decision to formally admit a patient is “a complex medical decision,” and even a stay of two or more nights at a hospital does not guarantee that a patient will be an inpatient.

  • For example, staying overnight in connection with a visit to the emergency room would not necessarily be an inpatient visit. Emergency-department services are outpatient services, as are observation services, which help a doctor decide whether a patient needs to be formally admitted.

The question of whether to formally admit a patient can be financially significant for the patient, because inpatient and outpatient visits are covered by different parts of the Medicare program (Part A and Part B, respectively).

  • Medicare patients pay a one-time deductible for the set of inpatient visits made during a year.
  • However, in general a copayment must be made for each outpatient visit, and the set of copayments made by a patient during a year can add up to more that the one-time deductible for inpatient visits. (In the table for outpatient visits, the feature DeductibleAmtPaid apparently gives patient copayments.)

Data integrity

The dataset appears clean, and it passed several checks for consistency among the data frames.

There is an apparent inconsistency in some outpatient visits: the two features that report payments for a visit (DeductibleAmtPaid, InscClaimAmtReimbursed) are both equal to zero in some cases.

  • This is true for about 4% of the outpatient visits in the training data, for instance.
  • Does zero payment indicate that payment was refused because fraud was suspected? Apparently not, because the fraction of zero-payment outpatient claims from providers labeled as “potential fraud” (36.1%) is quite close to the fraction for the full training set of outpatient visits (36.6%).
  • A possible explanation is that the dataset includes claims submitted to Medicare that were eventually covered by other insurers.
  • Another possibility is that in practice, hospitals receive zero payment for some outpatient visits, e.g., because the costs associated with collecting payment are too great to justify pursuing payment.

In most (or possibly all) cases, NA in the csv files can be viewed as an artifact of the way in which the data is presented, rather than an indication that a valid data point is missing.

Columns with NA:

  1. The tables for inpatient and outpatient visits include a number of columns in which codes relevant for the visit can be reported. For example, there are 10 columns ClmDiagnosisCode_1 through ClmDiagnosisCode_10 for claim diagnosis codes, and there are many NA in these columns.

    • Most NA in these columns can be viewed as an artifact of the table organization.
    • About 1.5% of outpatient visits (but no inpatient visits) in the dataset that have NA for all code columns.
    • Is it plausible that these outpatient Medicare claims had no codes when submitted, or is it more likely that the submitted codes are missing from the dataset? Further research would be needed to answer this question.
  2. Similarly, three columns are reserved for physicians associated with the visit: AttendingPhysician, OperatingPhysician, OtherPhysician.

    • Most NA in these columns can also be viewed as an artifact of the table organization.
    • There are 135 inpatient visits and 1688 outpatient visits with NA in all three of these columns.
    • Possible research questions:
      • Is an outpatient hospital visit necessarily associated with a physician? For instance, a nurse practitioner may be able to fill in for a physician for certain outpatient visits.
      • Do certain combinations of codes tend to lack a physician?
  3. For about 2% of the inpatient visits (but not for any outpatient visits), DeductibleAmtPaid is NA. As noted in the section on dataset scope, the deductible paid by inpatients was $1068 in all cases where it was not NA.

    • These NA probably correspond to visits for which the deductible payment was zero, because the patient had already paid the one-time deductible for the year.
    • However, there are samples that have NA for DeductibleAmtPaid with InscClaimAmtReimbursed equal to zero. For instance, there are 25 rows of this sort in the training data.

Training data summary

Healthcare providers

  • Number of providers: 5,410
  • Number of providers giving inpatient care: 2,092
  • Number of providers giving outpatient care: 5,012
  • Number of providers giving both inpatient and outpatient care: 1,694
  • 9.4% of providers have the “potential fraud” label.

Hospital visits

  • Number of hospital visits: 558,211
  • Number of inpatient visits: 40,474
  • Number of outpatient visits: 517,737

Each hospital visit is associated with a claim made by a provider. For purposes of exploratory data analysis, claims made by a provider that has been flagged for potential fraud also receive the “potential fraud” label.

Patients

  • Number of patients: 138,556
  • Number of inpatients: 31,289
  • Number of outpatients: 133,980
  • Number of patients receiving both inpatient and outpatient care: 26,713

Doctors

  • Number of doctors: 100,737
  • Number of inpatient doctors: 18,256
  • Number of outpatient doctors: 89,770
  • Number of doctors giving both inpatient and outpatient care: 7,289

3 Analysis

3.1 Chronic conditions

Summary

Since the dataset appears to represent a set of visits covered by Medicare, it is expected that the patients are mainly elderly and that many have chronic health problems. However, the frequency of some chronic conditions among these patients is quite high, and most patients have multiple chronic conditions.

Chronic conditions are widespread even among the Medicare patients who are not elderly. In the plot below, the patients under the age of 65 are disabled or have end-stage renal disease (permanent kidney failure).

The chronic conditions are widespread among the Medicare patients in all 50 US states as well as in Africa.

Number of conditions per patient

for (claim_type in claim_types) {
    plot_number_of_conditions(patients, claim_type)
}

Number of patients with condition

plot_chronic_count(patients, 'inpatient')
plot_chronic_count(patients, 'outpatient')

Percentage of patients with condition

for (variable_name in chronic_conditions) {
    plot_chronic_percent(patients_concatenated, variable_name)
}

Patient age

for (claim_type in claim_types) {
    plot_patient_age(patients, claim_type)
}

for (variable_name in chronic_conditions) {
    plot_chronic_by_age(patients, variable_name)
}

Visits per patient

for (claim_type in claim_types) {
    to_plot <- patients[[claim_type]]
    x_axis_values <- seq(min(to_plot$claim_count),
                         max(to_plot$claim_count))
    title <- str_c('Number of visits made by ', claim_type, 's')
    fig <- to_plot %>%
        ggplot(aes(x = claim_count)) +
        geom_bar(fill = 'navyblue') +
        scale_x_discrete('Number of visits', limits = factor(x_axis_values)) +
        ylab('Number of patients') +
        ggtitle(title)
    print(fig)
}

for (variable_name in chronic_conditions) {
    plot_visits_per_patient(patients, variable_name)
}

Geographic location

for (variable_name in chronic_conditions) {
    plot_chronic_by_location(patients, variable_name)
}

3.2 Reason for hospital visit

The data on hospital visits includes a column ClmAdmitDiagnosisCode that gives the diagnosis at the time the patient was admitted. This code can be roughly identified as the reason for the hospital visit.

The top five reasons for inpatient visits:

  1. Chest pain
  2. Shortness of breath
  3. Pneumonia
  4. Congestive heart failure (insufficient blood flow)
  5. Syncope (fainting or “passing out”)

The top five reasons for outpatient visits:

  1. Mammogram
  2. Atrial fibrillation (irregular heartbeat)
  3. High blood pressure
  4. Diabetes
  5. Monitoring of therapeutic drugs

3.3 Fraud associations for patients and doctors

Summary

For simplicity, providers that have the “potential fraud” label, as well as claims submitted by these providers, will be labeled as “fraudulent” in the remainder of the presentation.

The plots below show that for inpatient claims, both doctors and patients can both be roughly classified into two groups: those with no fraudulent claims and those with only fraudulent claims.

  • For outpatient claims, the same pattern can be seen among doctors, but the pattern is less pronounced among patients.

The first plot below also shows, however, a small but still significant fraction of inpatients has 50% fraudulent claims.

  • There may be distinct fraud methods associated with the different parts of the distributions shown in the tabbed subsections.

Note that the second plot below shows that in the category of inpatients having only a single claim, the median percentage of fraudulent claims is 100%. (About 57% of the inpatients with only a single claim are associated with a fraudulent provider.)

The rough separation of doctors into two groups can also be seen in the scatter plot below. Doctors with a sufficiently large number of claims are exclusively in the “only fraudulent claims” group.

Providers per patient

for (claim_type in claim_types) {
    plot_fraud_per_individual(
        patients,
        count_type = 'provider',
        individual_type = 'patient',
        claim_type = claim_type
    )
}

Claims per patient

for (claim_type in claim_types) {
    plot_fraud_per_individual(
        patients,
        count_type = 'claim',
        individual_type = 'patient',
        claim_type = claim_type
    )
}

Providers per doctor

for (claim_type in claim_types) {
    plot_fraud_per_individual(
        doctors,
        count_type = 'provider',
        individual_type = 'doctor',
        claim_type = claim_type
    )
}

Claims per doctor

Note that for bar charts in which the number of claims per doctor is shown on the horizontal axis, this axis has been truncated, because a few doctors have a very large number of claims. The true range can be seen from the scatter plots of fraudulent claims per doctor vs total claims per doctor.

for (claim_type in claim_types) {
    plot_fraud_vs_legit(doctors, claim_type)
    plot_fraud_per_individual(
        doctors,
        count_type = 'claim',
        individual_type = 'doctor',
        claim_type = claim_type,
        count_limit = 30.5
    )
}

3.4 Provider claim counts

Summary

There is a strong association between fraud and the number of claims submitted by a provider.

The plot below illustrates this association for inpatient claims. In this dataset, only fraudulent providers submit more than about 100 inpatient claims.

For 2009

plot_provider_claim_counts(provider_claim_counts, 'inpatient',
                           include_months = FALSE)
plot_provider_claim_counts(provider_claim_counts, 'outpatient',
                           include_months = FALSE)

The maximum number of inpatient claims for a provider in 2009 is 502.

The maximum number of outpatient claims for a provider in 2009 is 8208.

By month

Is there trend in the number of monthly claims submitted by providers during the year?

The plots below show that there is a trend, but this trend does not have a significant association with fraud.

plot_provider_claim_counts(provider_claim_counts, 'inpatient',
                           include_year = FALSE)
plot_provider_claim_counts(provider_claim_counts, 'outpatient',
                           include_year = FALSE)

3.5 Payments

Summary

The cost of hospital visits doesn’t show a strong association with fraud.

For multiple measures of payment per patient, a scatter plot of payment vs the patient’s percentage of fraudulent claims shows a symmetric pattern. This association is potentially significant, although it is difficult to interpret.

Visit cost

for (claim_type in claim_types) {
    title <- str_c('Visit cost for ', claim_type, 's')
    fig_base <- claims[[claim_type]] %>%
        ggplot(aes(x = visit_cost)) +
        log_scale_dollar('Visit cost', 'x') +
        ggtitle(title)
    plot_histograms(fig_base, y_label = 'visits', bins = 20)
}

Payments by insurer / patient

for (claim_type in claim_types) {
    to_plot <- claims[[claim_type]] %>%
        select(visit_cost, InscClaimAmtReimbursed,
               DeductibleAmtPaid, PotentialFraud) %>%
        filter(visit_cost != 0) %>%
        mutate(percent_covered = InscClaimAmtReimbursed / visit_cost)

    title <- str_c('Insurer payments, ', claim_type, ' visits')
    plot_payment_distribution(to_plot, InscClaimAmtReimbursed, title)

    title <- str_c('Deductible, ', claim_type, ' visits')
    plot_payment_distribution(to_plot, DeductibleAmtPaid, title)

    title <- str_c('Insurer payment vs deductible, ', claim_type, ' visits')
    fig <- to_plot %>%
        ggplot(aes(x = DeductibleAmtPaid, y = InscClaimAmtReimbursed)) +
        geom_point(aes(color = PotentialFraud)) +
        log_scale_dollar('Deductible', 'x') +
        log_scale_dollar('Payment by insurer', 'y') +
        ggtitle(title)
    print(fig)

    title <- str_c('Percent covered by insurer, ', claim_type, ' visits')
    fig_base <- to_plot %>%
        ggplot(aes(percent_covered)) +
        scale_x_continuous('Percent covered by insurer',
                           labels = percent_format()) +
        ggtitle(title)
    plot_histograms(fig_base, y_label = 'visits', bins = 30)
}

Payments per patient

for (claim_type in claim_types) {
    to_plot <- patients[[claim_type]] %>%
        select(all_of(payment_variables), claim_fraud_fraction) %>%
        # Patients for which the only visit has visit_cost == 0 show up
        # with missing values for the columns in payment_variables.
        drop_na()

    for (variable_name in payment_variables) {
        title <- str_c(payment_labels[variable_name], ', ',
                       claim_type, 's')
        fig <- to_plot %>%
            ggplot(aes_string(x = variable_name)) +
            geom_point(aes(y = claim_fraud_fraction),
                       color = 'navyblue') +
            log_scale_dollar(axis_label = 'Payment_amount', axis = 'x') +
            scale_y_continuous('Patient percentage fraudulent claims',
                               labels = label_percent()) +
            ggtitle(title)
        print(fig)
    }
}

By chronic condition

Patient payments do not show a strong dependence on individual chronic conditions that a patient has, possibly because many patients have multiple chronic conditions.

plot_payments_by_chronic(patients)

3.6 Visit / claim duration

Inpatient visit duration

Summary

“Inpatient visit duration” is used here to mean the time between an inpatient’s admission and discharge dates.

Plots in the tabbed subsections show that inpatient visit duration is affected by patient age and by the number of chronic conditions a patient has. But there isn’t a strong association between fraud and visit duration.

Distribution

The 36-day upper bound on visit durations is striking. I didn’t find any indication that 36 days is a significant threshold for inpatient visits covered by Medicare.

  • This may be an artifact of the (unknown) process by which data was selected for inclusion in the dataset. However, further research into this question could also be worthwhile.
title <- 'Distribution of inpatient visit duration'
fig_base <- inpatient_claims %>%
    ggplot(aes(x = visit_duration)) +
    xlab('Duration (days)') +
    ggtitle(title)
plot_bar_charts(fig_base, 'visits')

Patient age

fig <- inpatient_claims %>%
    ggplot(aes(x = patient_age, y = visit_duration)) +
    geom_point(aes(color = PotentialFraud)) +
    scale_x_continuous('Patient age (years)') +
    scale_y_continuous('Visit duration (days)') +
    ggtitle('Inpatient visit duration vs patient age')
print(fig)

Number of chronic conditions

to_plot <- inpatient_claims %>%
    select(BeneID, visit_duration) %>%
    left_join(
        select(inpatients, BeneID, all_of(chronic_conditions)),
        by = 'BeneID'
    ) %>%
    mutate(across(all_of(chronic_conditions), ~ (.) == 'Y'))
to_plot$number_of_conditions <- factor(rowSums(to_plot[, chronic_conditions]))
fig <- to_plot %>%
    ggplot(aes(x = number_of_conditions, y = visit_duration)) +
    geom_boxplot(aes(fill = number_of_conditions)) +
    xlab('Number of chronic conditions') +
    scale_y_continuous('Visit duration (days)') +
    guides(fill = 'none') +
    ggtitle('Inpatient visit duration vs number of chronic conditions')
print(fig)

Claim duration

Summary

As discussed in the tabbed subsection Terminology of Sec. 2 above, patients can stay at a hospital for multiple nights without being formally admitted as an inpatient.

The “claim duration”, given by (ClaimEndDt - ClaimStartDt), appears to be the duration of the patient’s stay in the hospital, including any period in which the patient was not formally admitted as an inpatient.

  • For most inpatient claims, the claim duration is identical to the visit duration.
  • The mean claim duration for outpatients is 2.4 days.

Similarly to visit duration, claim duration does not have a strong association with fraud.

The distribution of visit cost per day is nearly identical for fraudulent and legitimate claims, as illustrated in the plots below.

  • The underlying structure responsible for this similarity may be widespread duplication of legitimate claims.

Distribution

Note that for both inpatient and outpatient visits, the claims with maximum value of claim duration are all fraudulent. Close inspection of the plots, however, shows that these “all fraud” bars correspond to a very small number of claims.

  • For example, there are a significant number of inpatient visits with claim duration of 36 days. But the bar showing 100% fraud is for claim duration of 37 days, which includes just 2 inpatient claims.
for (claim_type in claim_types) {
    title <- str_c('Distribution of ', claim_type, ' claim duration')
    fig_base <- claims[[claim_type]] %>%
        ggplot(aes(x = claim_duration)) +
        xlab('Duration (days)') +
        ggtitle(title)
    plot_bar_charts(fig_base, 'visits')
}

Cost / cost per day

for (claim_type in claim_types) {
    to_plot <- claims[[claim_type]] %>%
        filter(visit_cost != 0) %>%
        select(PotentialFraud, visit_cost, claim_duration, cost_per_claim_day)

    title <- str_c('Claim amount vs claim duration, ', claim_type, ' visits')
    fig_base <- to_plot %>%
        ggplot(aes(x = claim_duration, y = visit_cost)) +
        scale_x_continuous('Claim duration (days)')
    plot_cost_vs_duration(fig_base, title)

    plot_cost_per_day(to_plot, variable_name = 'cost_per_claim_day',
                      claim_type = claim_type, duration_label = 'claim')
}

3.7 Duplicated claims

Summary

It is natural to suspect that fraudulent claims are duplicates of legitimate claims in some sense, particularly given that the distribution of cost per day for fraudulent and legitimate claims is essentially identical.

This section explores the possibility that an important method of duplication is to copy all diagnosis and procedure codes associated with a claim.

The first plot below shows that only a tiny fraction of inpatient claims are duplicated in this way, so further research is needed into methods used to duplicate claims.

Even though only a small number of inpatient claims are duplicated in this way, analysis of identical groups of claims reveals a category containing only fraudulent claims.

Number of duplicates

plot_duplicate_counts(claims)

Claims in identical groups

Consider only the claims for which there is at least one duplicate, and collect these into groups of identical claims. In these groups, how many claims are there per provider?

The second plot below shows that for inpatient claims, the group with two claims per provider is 100% fraud.

to_plot <- filter_out_no_codes(inpatient_claims) %>%
    filter(identical_claim_count > 1)

title_per_provider <- 'Number of claims per provider in identical groups, '
fig_base <- to_plot %>%
    ggplot(aes(x = as.factor(identical_claims_per_provider))) +
    xlab('Number of claims per provider') +
    ggtitle(str_c(title_per_provider, 'inpatients'))
plot_bar_charts(fig_base, 'claims')

to_plot <- filter_out_no_codes(outpatient_claims) %>%
    filter(identical_claim_count > 1)

fig_base <- to_plot %>%
    ggplot(aes(x = identical_claims_per_provider)) +
    scale_x_continuous('Number of claims per provider',
                       breaks = seq(from = 1.0, to = 2.2, by = 0.2)) +
    ggtitle(str_c(title_per_provider, 'outpatients'))
plot_histograms(fig_base, y_label = 'claims', bins = 30,
                breaks = seq(from = 20e3, to = 100e3, by = 20e3))

3.8 Time series

Summary

Plots of weekly visit counts were used to choose the range of dates included in time-series decompositions.

The daily visit counts show a small weekly seasonality. For instance, the most common day for the start of an outpatient visit is Monday, and the least common day is Friday.

Weekly visit count

From the plot of weekly inpatient visits, it appears that the first and last weekly dates with complete data may be 2008-12-29, 2009-12-21, respectively.

For the outpatient data, the period in which weekly data appears complete is 2009-01-05 to 2009-12-21.

Outside of these ranges, it’s possible that the data is incomplete, and so the time-series decompositions will include only the data in these ranges.

for (claim_type in claim_types) {
    plot_weekly_counts(claims, claim_type)
}

Daily visit count

for (claim_type in claim_types) {
    date_range <- valid_date_ranges[[claim_type]]
    to_plot <- claims[[claim_type]] %>%
        extract_series_data(date_range)

    title <- str_c('Daily visit count, ', claim_type, 's')
    fig <- plot_series(to_plot, title)
    print(fig)

    stl_model <- get_stl_model(to_plot)
    title <- str_c(
        'Decomposition of daily-visit time series, ', claim_type, 's'
    )
    print(plot(stl_model, main = title))

    title <- str_c('Weekly seasonality in visit count, ', claim_type, 's')
    fig <- plot_seasonality(to_plot, stl_model, title)
    print(fig)

    trend_curve <- trend(stl_model)
    trend_data <- data.frame(date = to_plot$ClaimStartDt, trend = trend_curve)
    title <- str_c('Trend in visit count, ', claim_type, 's')
    fig <- trend_data %>%
        ggplot(aes(x = date, y = trend)) +
        geom_line(color = 'navyblue') +
        xlab('Date') +
        ylab('Number of visits') +
        ggtitle(title)
    print(fig)
}

By initial diagnosis

Does the weekly seasonality depend on the reason for a hospital visit (as determined by the initial diagnosis code)?

It is tempting to interpret the weekly seasonalities shown below as meaningful, but the variation during the week is quite small, and the number of samples is much smaller than for the full sets of inpatient or outpatient claims. Further analysis would be needed to determine whether these apparent seasonalities are just random noise.

for (claim_type in claim_types) {
    date_range <- valid_date_ranges[[claim_type]]
    frequent_codes <- freq_admit_codes[[claim_type]] %>%
        arrange(desc(count))
    claim_data <- claims[[claim_type]] %>%
        filter(ClmAdmitDiagnosisCode %in% frequent_codes$code)

    for (admit_code in frequent_codes$code) {
        description <- code_descriptions[admit_code]
        series_data <- claim_data %>%
            filter(ClmAdmitDiagnosisCode == admit_code) %>%
            extract_series_data(date_range)

        title <- str_c('Daily visit count for ', tolower(description), ', ',
                       claim_type, 's')
        fig <- plot_series(series_data, title)
        print(fig)

        stl_model <- get_stl_model(series_data)

        title <- str_c('Weekly seasonality for ', tolower(description), ', ',
                       claim_type, 's')
        fig <- plot_seasonality(series_data, stl_model, title)
        print(fig)
    }
}

4 Conclusion: fraud associations

The strongest predictor of fraud in the dataset is the number of claims submitted by a provider. The fraudulent providers, which represent a relatively small percentage of the total, are responsible for a substantial percentage of claims submitted.

For inpatient claims, both doctors and patients can both be roughly classified into two groups: those with no fraudulent claims and those with only fraudulent claims.

In addition to this broad generalization, the analysis shows some subtler patterns in fraud associations involving patients.

Duplication of claims is a promising area for further investigation.